{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# SVM on Pima Indians Diabetes Data Set\n",
    "\n",
    "数据说明：\n",
    "Pima Indians Diabetes Data Set（皮马印第安人糖尿病数据集） 根据现有的医疗信息预测5年内皮马印第安人糖尿病发作的概率。   \n",
    "\n",
    "数据集共9个字段: \n",
    "0列为怀孕次数；\n",
    "1列为口服葡萄糖耐量试验中2小时后的血浆葡萄糖浓度；\n",
    "2列为舒张压（单位:mm Hg）\n",
    "3列为三头肌皮褶厚度（单位：mm）\n",
    "4列为餐后血清胰岛素（单位:mm）\n",
    "5列为体重指数（体重（公斤）/ 身高（米）^2）\n",
    "6列为糖尿病家系作用\n",
    "7列为年龄\n",
    "8列为分类变量（0或1）\n",
    "\n",
    "数据链接：https://archive.ics.uci.edu/ml/datasets/Pima+Indians+Diabetes"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 首先 import 必要的模块\n",
    "import pandas as pd \n",
    "import numpy as np\n",
    "\n",
    "from sklearn.model_selection import GridSearchCV\n",
    "\n",
    "#竞赛的评价指标为logloss\n",
    "from sklearn.metrics import log_loss  \n",
    "\n",
    "#SVM并不能直接输出各类的概率，所以在这个例子中我们用正确率作为模型预测性能的度量\n",
    "from sklearn.metrics import accuracy_score\n",
    "import warnings\n",
    "warnings.filterwarnings('ignore')\n",
    "\n",
    "from matplotlib import pyplot\n",
    "import seaborn as sns\n",
    "%matplotlib inline"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1、读取数据 "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "#input data\n",
    "train = pd.read_csv(\"FE_pima-indians-diabetes.csv\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2、数据预处理，将标签数据与特征数据分开"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "#  get labels\n",
    "y_train = train['Target']   \n",
    "X_train = train.drop([\"Target\"], axis=1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3、模型训练\n",
    "由于速度较慢，这里只调整了超参数C，没调L1/L2正则函数"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0.7669270833333334\n",
      "{'C': 0.01}\n"
     ]
    }
   ],
   "source": [
    "#线性SVM\n",
    "from sklearn.svm import SVC\n",
    "from sklearn.model_selection import GridSearchCV\n",
    "\n",
    "\n",
    "Cs = [0.001, 0.01, 0.1, 1, 10, 100, 1000]\n",
    "param_grid = {'C': Cs}\n",
    "grid = GridSearchCV(SVC(kernel='linear'), param_grid, cv=5, n_jobs = 4)\n",
    "\n",
    "grid.fit(X_train, y_train)\n",
    "print(grid.best_score_)\n",
    "print(grid.best_params_)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "大家可以对比一下用原始数据（标准化，但不做缺失值填补，即缺失值取0），效果更好。\n",
    "SVM对噪声比较敏感（或许原始数据清洗的时候已经是上帝最好的安排了:) ）"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 新特征工程后的模型训练（不做缺失值中值填补，只做标准化处理）"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [],
   "source": [
    "#input data\n",
    "train1 = pd.read_csv(\"pima-indians-diabetes.csv\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [],
   "source": [
    "#  get labels\n",
    "y_train1 = train1['Target']   \n",
    "X_train1 = train1.drop([\"Target\"], axis=1)\n",
    "\n",
    "#用于保存特征工程之后的结果\n",
    "feat_names = X_train1.columns\n",
    "\n",
    "# 数据标准化\n",
    "from sklearn.preprocessing import StandardScaler\n",
    "\n",
    "# 初始化特征的标准化器\n",
    "ss_X = StandardScaler()\n",
    "\n",
    "# 分别对训练和测试数据的特征进行标准化处理\n",
    "X_train1 = ss_X.fit_transform(X_train1)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [],
   "source": [
    "#存为csv格式\n",
    "X_train1 = pd.DataFrame(columns = feat_names, data = X_train1)\n",
    "\n",
    "train1 = pd.concat([X_train1, y_train1], axis = 1)\n",
    "\n",
    "train1.to_csv('FE_pima-indians-diabetes1.csv',index = False,header=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>pregnants</th>\n",
       "      <th>Plasma_glucose_concentration</th>\n",
       "      <th>blood_pressure</th>\n",
       "      <th>Triceps_skin_fold_thickness</th>\n",
       "      <th>serum_insulin</th>\n",
       "      <th>BMI</th>\n",
       "      <th>Diabetes_pedigree_function</th>\n",
       "      <th>Age</th>\n",
       "      <th>Target</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0.639947</td>\n",
       "      <td>0.848324</td>\n",
       "      <td>0.149641</td>\n",
       "      <td>0.907270</td>\n",
       "      <td>-0.692891</td>\n",
       "      <td>0.204013</td>\n",
       "      <td>0.468492</td>\n",
       "      <td>1.425995</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>-0.844885</td>\n",
       "      <td>-1.123396</td>\n",
       "      <td>-0.160546</td>\n",
       "      <td>0.530902</td>\n",
       "      <td>-0.692891</td>\n",
       "      <td>-0.684422</td>\n",
       "      <td>-0.365061</td>\n",
       "      <td>-0.190672</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>1.233880</td>\n",
       "      <td>1.943724</td>\n",
       "      <td>-0.263941</td>\n",
       "      <td>-1.288212</td>\n",
       "      <td>-0.692891</td>\n",
       "      <td>-1.103255</td>\n",
       "      <td>0.604397</td>\n",
       "      <td>-0.105584</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>-0.844885</td>\n",
       "      <td>-0.998208</td>\n",
       "      <td>-0.160546</td>\n",
       "      <td>0.154533</td>\n",
       "      <td>0.123302</td>\n",
       "      <td>-0.494043</td>\n",
       "      <td>-0.920763</td>\n",
       "      <td>-1.041549</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>-1.141852</td>\n",
       "      <td>0.504055</td>\n",
       "      <td>-1.504687</td>\n",
       "      <td>0.907270</td>\n",
       "      <td>0.765836</td>\n",
       "      <td>1.409746</td>\n",
       "      <td>5.484909</td>\n",
       "      <td>-0.020496</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   pregnants  Plasma_glucose_concentration  blood_pressure  \\\n",
       "0   0.639947                      0.848324        0.149641   \n",
       "1  -0.844885                     -1.123396       -0.160546   \n",
       "2   1.233880                      1.943724       -0.263941   \n",
       "3  -0.844885                     -0.998208       -0.160546   \n",
       "4  -1.141852                      0.504055       -1.504687   \n",
       "\n",
       "   Triceps_skin_fold_thickness  serum_insulin       BMI  \\\n",
       "0                     0.907270      -0.692891  0.204013   \n",
       "1                     0.530902      -0.692891 -0.684422   \n",
       "2                    -1.288212      -0.692891 -1.103255   \n",
       "3                     0.154533       0.123302 -0.494043   \n",
       "4                     0.907270       0.765836  1.409746   \n",
       "\n",
       "   Diabetes_pedigree_function       Age  Target  \n",
       "0                    0.468492  1.425995       1  \n",
       "1                   -0.365061 -0.190672       0  \n",
       "2                    0.604397 -0.105584       1  \n",
       "3                   -0.920763 -1.041549       0  \n",
       "4                    5.484909 -0.020496       1  "
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "train1.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0.7747395833333334\n",
      "{'C': 0.01}\n"
     ]
    }
   ],
   "source": [
    "#线性SVM\n",
    "from sklearn.svm import SVC\n",
    "from sklearn.model_selection import GridSearchCV\n",
    "\n",
    "\n",
    "Cs = [0.001, 0.01, 0.1, 1, 10, 100, 1000]\n",
    "param_grid = {'C': Cs}\n",
    "grid = GridSearchCV(SVC(kernel='linear'), param_grid, cv=5, n_jobs = 4)\n",
    "\n",
    "grid.fit(X_train1, y_train1)\n",
    "print(grid.best_score_)\n",
    "print(grid.best_params_)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 采用RBF核的超参数调优"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0.76171875\n",
      "{'C': 0.1, 'gamma': 0.1}\n"
     ]
    }
   ],
   "source": [
    "#线性SVM\n",
    "from sklearn.svm import SVC\n",
    "from sklearn.model_selection import GridSearchCV\n",
    "\n",
    "\n",
    "Cs = [0.001, 0.01, 0.1, 1, 10, 100, 1000]\n",
    "gamma_s = np.logspace(-1, 1, 3)\n",
    "param_grid = {'C': Cs,'gamma':gamma_s}\n",
    "grid = GridSearchCV(SVC(kernel='rbf'), param_grid, cv=5, n_jobs = 4)\n",
    "\n",
    "grid.fit(X_train, y_train)\n",
    "print(grid.best_score_)\n",
    "print(grid.best_params_)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
